Resources
Code
Approaches
Resampling
Oversampling
Undersampling
- Down-sampling involves randomly removing observations from the majority class to prevent its signal from dominating the learning algorithm. The most common heuristic for doing so is resampling without replacement.
- https://machinelearningmastery.com/undersampling-algorithms-for-imbalanced-classification/
- Cluster. Cluster centroids is a method that replaces cluster of samples by the cluster centroid of a K-means algorithm, where the number of clusters is set by the level of undersampling.
- Tomek links. Tomek links remove unwanted overlap between classes where majority class links are removed until all minimally distanced nearest neighbor pairs are of the same class. Tomek links are pairs of instances of opposite classes who are their own nearest neighbors. Tomek’s algorithm looks for such pairs and removes the majority instance of the pair.
- Throw away minority examples and switch to an anomaly detection framework
Adjust the class importance or the metric
- At the algorithm level, or after: Adjust the class weight (misclassification costs), adjust the decision threshold. Many machine learning toolkits have ways to adjust the “importance” of classes (classifiers that take an optional class_weight).
- Change the metric.
- Evaluating the classifier: Accuracy is not a good metric for imbalanced classes!!
- Use a ROC curve
- Don’t get hard classifications (labels) from your classifier (via score or predict). Instead, get probability estimates via proba or predict_proba
- No matter what you do for training, always test on the natural (stratified) distribution your classifier is going to operate upon. Seesklearn.cross_validation.StratifiedKFold
- For a singe metric (value): AUC, F1 (harmonic mean of precision and recall), Cohen’s Kappa (evaluation statistic that takes into account how much agreement would be expected by chance)
- https://medium.com/towards-data-science/what-metrics-should-we-use-on-imbalanced-data-set-precision-recall-roc-e2e79252aeba
- http://machinelearningmastery.com/classification-accuracy-is-not-enough-more-performance-measures-you-can-use/
- The following performance measures that can give more insight into the accuracy of the model than traditional classification accuracy:
- Confusion Matrix: A breakdown of predictions into a table showing correct predictions (the diagonal) and the types of incorrect predictions made (what classes incorrect predictions were assigned).
- Precision: A measure of a classifiers exactness.
- Recall: A measure of a classifiers completeness
- F1 Score (or F-score): A weighted average of precision and recall.
- Kappa (or Cohen’s kappa): Classification accuracy normalized by the imbalance of the classes in the data.
- ROC Curves: Like precision and recall, accuracy is divided into sensitivity and specificity and models can be chosen based on the balance thresholds of these values.
Cost-sensitive training
- Cost-Sensitive Training. For this tactic we use penalized learning algorithms that increase the cost of classification mistakes on the minority class. A popular algorithm for this technique is Penalized-SVM. During training, we can use the argument class_weight='balanced' to penalize mistakes on the minority class by an amount proportional to how under-represented it is.
Select or create a suitable algorithm